[superlog] Downgrade recoverable AI tool errors from ERROR to WARN in job-level wide event#483
Conversation
… job-level wide event
|
The latest updates on your projects. Learn more about Vercel for GitHub.
2 Skipped Deployments
|
|
The latest updates on your projects. Learn more about Unkey Deploy
|
|
Found 7 test failures on Blacksmith runners: Failures
|
Greptile SummaryThis PR fixes false-positive Superlog incidents caused by recoverable AI tool failures (e.g. a failed ClickHouse SQL query or a NOT_FOUND annotation response) escalating the job-level wide event to ERROR even when the job ultimately succeeded. The fix is a one-line change in
Confidence Score: 4/5Safe to merge for the stated goal; the job-failure path is untouched and genuine incidents will still fire at ERROR. The change is small, well-reasoned, and correctly targets the evlog escalation mechanism. The main open question is whether blanket-downgrading every tool The single changed file Important Files Changed
Sequence Diagram%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
participant Tool as AI Tool (e.g. annotations.ts)
participant TL as createToolLogger
participant RL as requestLogger (evlog wide event)
participant JR as jobs.ts (job runner)
Note over JR,RL: Job starts — wide event created
JR->>RL: createInsightsEventLog(context)
Tool->>TL: logger.error("Failed to create annotation", ctx)
alt BEFORE this PR
TL->>RL: "requestLogger.error(new Error(msg), {aiTool, ...ctx})"
Note over RL: hasError = true — wide event escalated to ERROR
JR->>RL: "logger.emit({ job_status: "succeeded" })"
Note over RL: emits at ERROR — false-positive incident
else AFTER this PR
TL->>RL: "requestLogger.warn(message, {aiTool, ...ctx})"
Note over RL: hasError unchanged — wide event stays at INFO/WARN
JR->>RL: "logger.emit({ job_status: "succeeded" })"
Note over RL: emits at correct level — no false-positive
end
Note over JR,RL: Genuine failure path (unchanged)
JR->>RL: logger.error(err)
JR->>RL: "logger.emit({ job_status: "failed" })"
Note over RL: emits at ERROR — real incident fires correctly
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
participant Tool as AI Tool (e.g. annotations.ts)
participant TL as createToolLogger
participant RL as requestLogger (evlog wide event)
participant JR as jobs.ts (job runner)
Note over JR,RL: Job starts — wide event created
JR->>RL: createInsightsEventLog(context)
Tool->>TL: logger.error("Failed to create annotation", ctx)
alt BEFORE this PR
TL->>RL: "requestLogger.error(new Error(msg), {aiTool, ...ctx})"
Note over RL: hasError = true — wide event escalated to ERROR
JR->>RL: "logger.emit({ job_status: "succeeded" })"
Note over RL: emits at ERROR — false-positive incident
else AFTER this PR
TL->>RL: "requestLogger.warn(message, {aiTool, ...ctx})"
Note over RL: hasError unchanged — wide event stays at INFO/WARN
JR->>RL: "logger.emit({ job_status: "succeeded" })"
Note over RL: emits at correct level — no false-positive
end
Note over JR,RL: Genuine failure path (unchanged)
JR->>RL: logger.error(err)
JR->>RL: "logger.emit({ job_status: "failed" })"
Note over RL: emits at ERROR — real incident fires correctly
|
Summary
Insights generation jobs that encounter a recoverable AI tool failure (e.g. a ClickHouse SQL query error, an annotation creation returning NOT_FOUND) emit their final job-completion log at ERROR severity even when
job_status: "succeeded". This creates false-positive Superlog incidents on every scheduled batch run.Root cause
createToolLogger.error()inpackages/ai/src/ai/tools/utils/logger.tscallsrequestLogger.error(err, …), which sets evlog's internalhasError = trueon the job-scoped wide event. When the job runner callslogger.emit({ job_status: "succeeded" })at the end of a successful run, evlog'semit()fires at ERROR level becausehasErroris still set — even though the AI agent recovered from the tool failure and the job completed successfully.Fix
Change the request-scoped path in
createToolLogger.error()fromrequestLogger.error()torequestLogger.warn(). All error context (tool name, SQL snippet, error message) is still merged into the wide event at WARN severity. Genuine job failures continue to log at ERROR —apps/insights/src/jobs.tscallslogger.error(err)directly on the job-failure path, which is unaffected by this change. Thelog.error(…)fallback (used outside a job scope) is also unchanged.Alternative approach: move the level-escalation guard into the job runner itself (only emit at ERROR when
job_status: "failed"), but this would require changes tojobs.tsfor every call site and doesn't fix the same pattern if it appears in other services usingcreateToolLogger.Incident on Superlog
Was this PR helpful? Leave feedback — goes straight to the Superlog team.
Summary by cubic
Downgraded recoverable tool errors from ERROR to WARN in job-scoped events to prevent false-positive incidents when a job ultimately succeeds. This keeps successful insights runs from emitting a final ERROR log.
packages/ai/src/ai/tools/utils/logger.ts, updatedcreateToolLogger.error()to userequestLogger.warn(message, …)instead ofrequestLogger.error(err, …)so recoverable tool failures don’t escalate the wide event; real job failures still log at ERROR via the job runner.Written for commit 0487ce1. Summary will update on new commits.